Feature  Reduction  and  Hierarchy  of  Classifiers 
for  Fast  Object  Detection  in  Video  Images 


Bernd  Heisele^  Thomas  Serre^  Sayan  Mukherjee^  Tomaso  Poggio^ 
^Center  for  Biological  and  Computational  Learning,  M.I.T. 
Cambridge,  MA,  USA 

^onda  R&D  Americas,  Inc.,  Boston,  MA,  USA 
{heisele,serre,sayan,tp}@  ai.mit.edu 


Abstract 

We  present  a  two-step  method  to  speed-up  object  detec¬ 
tion  systems  in  computer  vision  that  use  Support  Vector  Ma¬ 
chines  (SVMs)  as  classifiers.  In  a  first  step  we  perform  fea¬ 
ture  reduction  by  choosing  relevant  image  features  accord¬ 
ing  to  a  measure  derived  from  statistical  learning  theory.  In 
a  second  step  we  build  a  hierarchy  of  classifiers.  On  the 
bottom  level,  a  simple  and  fast  classifier  analyzes  the  whole 
image  and  rejects  large  parts  of  the  background.  On  the 
top  level,  a  slower  but  more  accurate  classifier  performs 
the  final  detection.  Experiments  with  a  face  detection  sys¬ 
tem  show  that  combining  feature  reduction  with  hierarchi¬ 
cal  classification  leads  to  a  speed-up  by  a  factor  of  1 70  with 
similar  classification  performance. 

1  Introduction 

Most  object  detection  tasks  in  computer  vision  are  com¬ 
putationally  expensive  because  of  a)  the  large  amount  of 
input  data  that  has  to  be  processed  and  b)  the  use  of  com¬ 
plex  classifiers  that  are  robust  against  pose  and  illumination 
changes.  Speeding-up  the  classification  is  therefore  of  ma¬ 
jor  concern  when  developing  systems  for  real-world  appli¬ 
cations.  In  the  following  we  investigate  two  methods  for 
speed-ups:  feature  reduction  and  hierarchical  classification. 

In  [3]  we  presented  a  system  for  detecting  frontal  and 
near- frontal  views  of  faces  in  still  gray  images.  The  sys¬ 
tem  achieved  high  detection  accuracy  by  classifying  19x19 
gray  patterns  using  a  non-linear  SVM.  However,  searching 
an  image  for  faces  at  different  scales  took  several  minutes 
on  a  PC — far  too  long  for  most  real-world  applications.  One 
way  to  speed-up  is  to  reduce  the  number  of  features. 

There  are  basically  two  types  of  feature  selection  meth¬ 
ods  in  the  literature:  filter  and  wrapper  methods  [1].  Filter 
methods  are  preprocessing  steps  performed  independently 
of  the  classification  algorithm  or  its  error  criteria;  PCA  is 


an  example  of  a  filter  method.  Wrapper  methods  attempt 
to  search  through  the  space  of  feature  subsets  using  the 
criterion  of  the  classification  algorithm  to  select  the  opti¬ 
mal  feature  subset.  Wrapper  methods  can  provide  more 
accurate  solutions  than  filter  methods  [5],  but  in  general 
are  more  computationally  expensive.  We  present  a  new 
wrapper  method  to  reduce  the  dimensions  of  both  input 
and  feature  space  of  an  SVM.  An  alternative  approach  for 
speeding-up  SVM  classification  has  been  proposed  in  [7] 
by  reducing  the  number  of  support  vectors. 

Feature  reduction  is  a  generic  tool  that  can  be  applied  to 
any  classification  problem.  When  dealing  with  a  specific 
classification  task  we  can  use  prior  knowledge  about  the 
type  of  data  to  speed-up  classification.  Two  assumptions 
hold  for  most  vision-based  object  detection  tasks:  a)  The 
vast  majority  of  the  analyzed  patterns  in  an  image  belongs 
to  the  background  class  and  b)  most  of  the  background  pat¬ 
terns  can  be  easily  distinguished  from  the  objects.  Based  on 
these  two  assumptions  it  is  sensible  to  apply  a  hierarchy  of 
classifiers.  Fast  classifiers  remove  large  parts  of  the  back¬ 
ground  on  the  bottom  and  middle  levels  of  the  hierarchy  and 
a  more  accurate  but  slower  classifier  performs  the  final  de¬ 
tection  on  the  top  level.  This  idea  falls  into  the  framework  of 
coarse-to-fine  template  matching  [8,2]  and  is  also  related  to 
biologically  motivated  work  on  attention-based  vision  [4] . 

More  recently  a  cascade  of  linear  classifiers  that  have 
been  trained  using  AdaBoost  has  been  proposed  in  [12]  for 
frontal  face  detection.  This  idea  is  related  to  ours  in  the 
sense  that  it  combines  hierarchical  classification  with  fea¬ 
ture  selection.  However,  in  our  approach  the  complexity  of 
the  classifiers  in  the  hierarchy  is  not  only  controlled  by  the 
number  of  features  (image  resolution)  but  also  by  the  class 
of  decision  functions  (i.e.  class  of  SVM  kernel  functions). 
The  bottom  level  of  our  hierarchy  consists  of  a  linear  clas¬ 
sifier  that  operates  on  low  resolution  patterns  (9  x  9)  while 
the  top  level  consists  of  a  non-linear  classifier  operating  on 
higher  resolution  patterns  (19x  19). 
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In  Section  2  we  give  a  brief  overview  on  SVM  theory 
and  describe  the  training  and  test  data  used  in  our  experi¬ 
ments.  In  Section  3  we  rank  and  select  features  in  the  input 
space.  Feature  selection  in  the  feature  space  of  the  classifier 
is  described  in  Section  4.  In  Section  5  we  present  the  hier¬ 
archical  structure  of  classifiers.  The  paper  is  concluded  in 
Section  6. 

2  Background 

2.1  Support  Vector  Machine  Theory 

An  SVM  [11]  performs  pattern  recognition  for  a  two- 
class  problem  by  determining  the  separating  hyperplane  that 
has  maximum  distance  to  the  closest  points  of  the  training 
set.  These  closest  points  are  called  support  vectors.  To 
perform  a  non-linear  separation  in  the  input  space  a  non¬ 
linear  transformation  $(•)  maps  the  data  points  x  of  the  in¬ 
put  space  IR^  into  a  high  dimensional  space,  called  feature 
space  IR^  (p  >  n).  The  mapping  $(•)  is  represented  in  the 
SVM  classifier  by  a  kernel  function  iT(' ,  •)  which  defines  an 
inner  product  in  IR^.  Given  i  examples  {(x^,  the 

decision  function  of  the  SVM  is  linear  in  the  feature  space 
and  can  be  written  as: 

/(x)  =  w  ■  $(x)  +  b  =  '^a°yiK{xi,x)  +  b.  (1) 

i=l 

The  optimal  hyperplane  is  the  one  with  the  maximal  dis¬ 
tance  (in  space  IR^)  to  the  closest  points  $(xi)  of  the  train¬ 
ing  data.  Determining  that  hyperplane  leads  to  maximizing 
the  following  functional  with  respect  to  a: 

£  £ 

W^ia)  =  2'^ai  -  aiajyiyjK(iCi,Xj)  (2) 

i=l  i,j=l 

under  constraints  =  0  and  (7  >  >  0,  i  = 

An  upper  bound  on  the  expected  error  probability 
EPerr  of  ^n  SVM  classifier  is  given  by  [11]: 

EPerr  <  \e  =  \e  {R^W\a^))  (3) 

where  M  =  is  the  distance  between  the  support  vec¬ 

tors  and  the  separating  hyperplane  and  R  is  the  radius  of  the 
smallest  sphere  including  all  points  $  (x  i  (x^ )  of  the 

training  data  in  the  feature  space.  In  the  following,  we  will 
use  this  bound  on  the  expected  error  probability  to  rank  and 
select  features. 

2.2  Computational  Issues 

The  only  non-linear  kernel  investigated  in  this  paper  is  a 
second-degree  polynomial  kernel  iT(x,  y)  =  (1  -h  x  •  y)  ^ 


which  has  been  successfully  applied  to  various  object  de¬ 
tection  tasks  [6,  3].  Eq.  (1)  shows  two  ways  of  computing 
the  decision  function.  When  using  the  kernel  representation 
on  the  right  side  of  Eq.  (1)  the  number  of  multiplications  re¬ 
quired  to  calculate  the  decision  function  for  a  second-degree 
polynomial  kernel  is: 

Mk,poly2  —  {tL  2)  •  S,  (4) 

where  n  is  the  dimension  of  the  input  space  and  s  is  the 
number  of  support  vectors.  This  number  is  independent 
of  the  dimensionality  of  the  feature  space.  It  depends  on 
the  number  of  support  vectors  which  is  linear  with  the 
size  I  of  the  training  data  [11].  On  the  other  hand,  the 
computation  of  the  decision  function  in  the  feature  space 
is  independent  of  the  size  of  training  samples,  it  only 
depends  on  the  dimensionality  p  of  the  feature  space.  For 
the  second-degree  polynomial  kernel  the  feature  space 
IR^  has  dimension  p  =  and  is  given  by  x*  = 

(a/2^1  5  '  '  '  5  5  '  '  '  5  a/2^1^2  j  '  '  '  j  V^^n— l^n)* 

Thus  the  number  of  multiplications  required  for  projecting 
the  input  vector  into  the  feature  space  and  for  computing 
the  decision  function  is: 

(n -h  l)n  (n -h  3)n  , 

Mf^poiy2  — - ^ - ^ - ^ - —  (n  -h  2)  •  n  (5) 

From  Eq.  (4)  and  (5)  we  see  that  the  computation  for  an 
SVM  with  second-degree  polynomial  is  more  efficiently 
done  in  the  feature  space  if  the  number  of  support  vectors  is 
bigger  than  n.  This  was  always  the  case  in  our  experiments; 
the  number  of  support  vectors  was  between  2  and  6  times 
larger  than  n.  That  is  why  we  investigated  not  only  methods 
for  reducing  the  number  of  input  features  but  also  methods 
for  feature  reduction  in  the  feature  space. 

2.3  Training  and  test  sets 

In  our  experiments  we  used  one  training  and  two  test 
sets.  The  positive  training  set  contained  2,429  19  x  19  faces. 
The  negative  training  set  contained  4,548  randomly  selected 
non-face  patterns.  The  relatively  small  size  of  the  training 
set  affects  the  classification  performance.  Experiments  in 
[3]  show  that  for  a  given  classifier  the  false  positive  rate  can 
be  reduced  by  a  factor  of  10  by  increasing  the  training  set 
using  bootstrapping  methods.  In  this  paper  the  main  goal 
was  to  speed-up  a  given  classifier  without  loss  of  classifica¬ 
tion  performance.  We  opted  for  a  small  training  set  in  order 
to  save  time  during  training  classifiers  in  our  numerous  ex¬ 
periments. 

The  test  set  was  extracted  from  the  CMU  test  set  1 7  We 
extracted  472  faces  and  23,570  non-face  patterns.  The  non¬ 
face  patterns  were  selected  by  a  linear  SVM  classifier  as  the 

^The  test  set  is  a  subset  of  the  CMU  test  set  1  [9]  which  consists  of  130 
images  and  507  faces.  We  excluded  12  images  containing  line-drawn  faces 
and  non-frontal  faces. 


non-face  patterns  most  similar  to  faces.  The  final  evaluation 
of  our  system  was  performed  on  the  entire  CMU  test  set  1 , 
containing  118  images.  Processing  all  images  at  different 
scales  resulted  in  about  57,000,000  analyzed  19x19  win¬ 
dows. 

3  Dimension  reduction  in  the  Input  Space 
3.1  Ranking  Features  in  the  Input  Space 

In  [13]  a  gradient  descent  method  is  proposed  to  rank  the 
input  features  by  minimizing  the  bound  of  the  expectation 
of  the  leave-one-out  error  of  the  classifier.  The  basic  idea  is 
to  re-scale  the  n-dimensional  input  space  by  a  n  x  n  diagonal 
matrix  a  such  that  ^  is  minimized.  The  new  mapping 
function  is  then  $o^(x)  =  $(cr  •  x)  and  the  kernel  function 
isiT^(x,y)  =  iT(cr-x,cr-y)  =  ($^(x)  •  $^(y)).  The 
decision  function  given  in  Eq.  (1)  becomes: 

/(x,  a)  =  w  ■  $^(x)  +  &  =  aiViK^ixi,  x)  +  &  (6) 

i=l 

The  maximization  problem  of  Eq.  (2)  is  now  given  by: 

£  ^  £ 

W^(a,a)  =  “  2  ^  o-iajyiyjK^{xi,Xj)  (7) 

i=l  hj=^ 

subject  to  constraints  =  0  and  C  >  ai  > 

0,  i  =  1 , . . . ,  We  solve  for  a  and  a  using  an  iterative  pro¬ 
cedure:  We  initialize  cr  as  a  vector  of  ones  and  then  solve 
Eq.  (7)  and  for  the  margin  and  radius.  Using  the  values  for  a 
and  P  from  the  above  equations  and  the  bound  in  Eq.  (3)  we 
compute  a  by  minimizing  W‘^  {a^  a)R^  {P ^  a)  using  a  gra¬ 
dient  descent  procedure.  We  then  start  a  new  iteration  of  the 
algorithm  using  the  a i  of  the  current  iteration  as  initializa¬ 
tion.  We  applied  the  ranking  method  to  283  gray  features 
generated  by  preprocessing  19x19  image  patterns  as  de¬ 
scribed  in  [10].  Additionally  we  performed  tests  with  PC  A 
gray  features  that  we  obtained  by  projecting  the  data  points 
into  the  283  dimensional  eigenvector  space.  PC  A  was  com¬ 
puted  on  a  the  combined  set  of  positive  and  negative  sets. 
We  computed  one  iteration  of  the  algorithm  in  all  of  our  ex¬ 
periments.  The  tests  were  performed  on  the  small  test  set 
for  60,  80  and  100  ranked  features.  The  ROC  curves  for 
second-degree  polynomial  SVMs  are  shown  in  Pig.  1.  Por 
100  features  there  is  no  difference  between  gray  and  PC  A 
gray  features.  Por  80  and  60  features,  however,  the  PC  A 
gray  features  gave  clearly  better  results.  Por  this  reason  we 
focused  in  the  remainder  of  the  paper  on  PC  A  features  only. 
An  interesting  observation  is  that  the  ranking  of  the  PCA 
features  obtained  by  the  above  described  gradient  descent 
method  was  similar  to  the  ranking  by  decreasing  eigenval¬ 
ues. 


Training:  2,429  faces  /  4,548  non-faces. 

Test:  Reduced  CMU  set  / 118  images  /  472  faces  /  23,573  windows. 


Training:  2,429  faces  /  4,548  non-faces. 

Test:  Reduced  CMU  set  / 118  images  /  472  faces  /  23,573  windows. 


Figure  1.  ROC  curves  for  gray  (top)  and  PCA 
gray  (bottom)  features  with  the  60, 80  and  100 
best  ranked  features. 


3.2  Selecting  Features  in  the  Input  Space 

In  Section  3.1  we  ranked  the  features  according  to  their 
scaling  factors  ai .  Now  the  problem  is  to  determine  a  subset 
of  the  ranked  features  (x  i ,  X2 , . . . ,  .  This  problem  can  be 
formulated  as  finding  the  optimal  subset  of  ranked  features 
(xi,X2,  among  the  n  possible  subsets,  where  n* 

is  the  number  of  selected  features.  As  a  measure  of  the 
classification  performance  of  an  SVM  for  a  given  subset  of 
ranked  features  we  used  again  the  bound  on  the  expected 
error  probability. 

EP.„  <  \e  (8) 

To  simplify  the  computation  of  our  algorithm  and  to  avoid 
solving  a  quadratic  optimization  problem  in  order  to  com¬ 
pute  the  radius  R,  we  approximated^  R‘^  by  2p  where  p  is 

^We  previously  normalized  all  the  data  in  to  be  in  a  range  between 
0  and  1.  As  a  result  the  points  lay  within  a  p-dimensional  cube  of  length 
\/2  in  IR^  and  the  smallest  sphere  including  all  the  data  points  is  upper 


Estimation  of  Bound  on  Generaiization  Error 


Figure  2.  Approximation  of  estimated  bound 
on  the  expected  error  versus  number  of  prin- 
cipai  components.  The  vaiues  on  the  y-axis 
are  not  normaiized  by  the  number  of  training 
samples. 


the  dimension  of  the  feature  space  IR^.  For  a  second-degree 
polynomial  kernel  of  type  (1  -I-  x  ■  y)  ^  we  get: 

EPerr  <  \  E  {W^a°))  <  j  H* {n* +3)  E  {W^{a°)) 

(9) 

where  n*  is  the  number  of  selected  features^.  The  estimated 
bound  on  the  expected  error  is  shown  in  Fig.  2.  We  had  no 
training  error  for  more  than  22  selected  features.  The  esti¬ 
mated  bound  on  the  expected  error  shows  a  plateau  between 
30  to  60  features,  then  it  increases  steadily. 

The  bound  in  Eq.  (8)  is  considered  to  be  a  loose  bound 
on  the  expected  error.  To  check  if  the  bound  is  of  practi¬ 
cal  use  for  selecting  the  number  of  features  we  performed 
tests  on  the  CMU  test  set.  In  Fig.  3  we  compare  the  ROC 
curves  obtained  for  different  numbers  of  selected  features. 
The  results  show  that  using  more  than  60  features  does  not 
improve  the  performance  of  the  system.  This  observation 
coincides  with  the  run  of  the  curve  in  Fig.  2.  However,  the 
error  on  the  test  set  does  not  change  significantly  for  more 
than  70  features  although  the  estimated  bound  on  the  ex¬ 
pected  error  shown  in  Fig.  2  increases.  Probably  because 
our  bound  gets  looser  with  increasing  number  of  features 
through  the  approximation  of  R  by  the  dimensionality  of 
the  feature  space. 

4  Feature  Reduction  in  the  Feature  Space 

In  the  previous  Section  we  described  how  to  reduce  the 
number  of  features  in  the  input  space.  Now  we  consider  the 

bound  by  \/2p. 

^As  we  used  a  second-degree  polynomial  SVM  the  dimension  of  the 
feature  space  p  =  n*(n*  +  3)/2. 


Training:  2,429  faces  /  4,548  non-faces. 

Test:  CMU  set  1,  118  images  /  479  faces  /  56,774,966  windows. 


Figure  3.  ROC  curves  for  different  number  of 
PCA  gray  features. 


problem  of  reducing  the  number  of  features  in  the  feature 
space.  We  use  a  new  method  based  on  the  contribution  of 
the  features  from  the  feature  space  to  the  decision  function 
/(x)  of  the  SVM. 

e 

/(x)  =  w  •  $(x)  6  =  ^  a°yiK{xi,  x)  +  b  (10) 

i=l 

with  w  =  {wi,  ...,Wp).  For  a  second-degree  polyno¬ 
mial  kernel  with  K(x,y)  =  (1  -h  x  •  y)^,  the  feature 
space  IR^  with  dimension  p  =  is  given  by  x*  = 

(a/2^1  5  '  '  '  5  5  '  '  '  5  a/2^1^2  j  '  '  '  j  V^^n— l^n)* 

The  contribution  of  a  feature  to  the  decision  function  in 
Eq.  (10)  depends  onwk  -  A  straightforward  way  to  order  the 
features  is  by  ranking  \wk\.  Alternatively,  we  weighted  w 
by  the  support  vectors  to  account  for  different  distributions 
of  the  features  in  the  training  data.  The  features  were 
ordered  by  ranking  where  x* ^  denotes 

the  k-th  component  of  support  vector  i  in  feature  space 
IR^.  For  both  methods  we  first  trained  an  SVM  with  a 
second-degree  polynomial  kernel  on  60  PCA  gray  features 
of  the  input  space  which  corresponds  to  1891  features  in 
IR^.  We  then  calculated 

£^(>5)  =  ly^|/(xi) -/s(xi)|  (11) 

for  all  s  Support  Vectors,  where  /s'(x)  is  the  decision  func¬ 
tion  using  the  S  first  features  according  to  their  ranking. 
Note  that  in  contrast  to  the  previously  described  selection 
of  features  in  the  input  space  this  method  does  not  require 
the  retraining  of  SVMs  for  different  feature  sets.  The  results 
in  Fig.  4  show  that  ranking  by  the  weighted  components  of 
w  lead  to  a  faster  convergence  of  E{S)  from  Eq.  (11)  to¬ 
wards  0.  Fig.  5  shows  the  ROC  curves  for  500  and  1000 


Partial  Sum  for  Support  Vectors 


Number  of  selected  features 


Figure  4.  Classifying  support  vectors  with 
a  reduced  number  of  features.  The  x-axis 
shows  the  number  of  features,  the  ^-axis  is 
the  mean  absolute  difference  between  the 
output  of  the  SVM  using  all  features  and  the 
same  SVM  using  the  S  first  features  only.  The 
features  were  ranked  according  to  the  com¬ 
ponents  and  the  weighted  components  of  the 
normal  vector  of  the  separating  hyperplane. 


features.  As  a  reference  we  added  the  ROC  curve  for  a 
second-degree  SVM  trained  on  the  original  283  gray  fea¬ 
tures.  This  corresponds  to  =  40^  469  compo¬ 

nents  in  the  feature  space.  By  combining  both  methods  of 
feature  reduction  we  could  reduce  the  dimensionality  by  a 
factor  of  about  40  without  loss  in  performance. 

5  Hierarchy  of  classifiers 

5.1  System  Overview 

In  most  object  detection  problems  the  majority  of  ana¬ 
lyzed  image  patches  belongs  to  the  background  class.  Only 
a  small  percentage  of  these  patches  look  similar  to  objects 
and  require  a  highly  accurate  classifier  to  avoid  false  clas¬ 
sifications.  For  this  reason  we  developed  a  3 -level  hierar¬ 
chy  of  classifiers  where  the  computational  complexity  of 
the  classifiers  increases  with  each  level.  By  propagating 
only  those  patterns  that  were  classified  as  faces,  we  quickly 
decrease  the  amount  of  data  when  going  up  the  hierarchy. 
The  bottom  level  of  our  hierarchy  consisted  of  a  linear 
SVM  that  was  trained  on  9x9  face  images.  On  the  sec¬ 
ond  level  we  increased  the  image  resolution  by  a  factor  of 
two  (19x  19  face  patterns)  but  kept  the  linear  kernel.  On 
the  third  level  we  finally  used  our  best  classifier,  a  non¬ 
linear  SVM  with  a  second-degree  polynomial  kernel  that 
was  trained  on  19x  19  images.  This  classifier  is  highly  sen¬ 
sitive  to  translation.  If  a  face  is  not  centered  in  the  classi¬ 


Training:  2,429  faces  /  4,548  non-faces. 

Test:  CMU  set  1  / 118  images  /  479  faces  /  56,774,966  windows. 


Figure  5.  ROC  curves  for  different  dimension 
of  the  feature  space. 


fication  window  it  is  likely  to  be  classified  as  a  non-face. 
In  order  not  to  miss  any  faces  we  search  for  faces  in  a  small 
neighborhood  around  each  detection  location  that  was  prop¬ 
agated  from  the  second  level.  This  means  that  we  analyze 
16  patterns  on  level  three  for  each  pattern  that  was  classified 
as  a  face  on  level  two.  Fig.  9  gives  an  impression  of  the 
performance  of  the  three  individual  classifiers;  shown  are 
the  detection  results  for  images  from  the  CMU  test  set  1 . 

5.2  Experiments 

All  three  classifiers  of  our  hierarchical  system  were 
trained  on  the  same  training  set  of  2,429  faces  and  4,548 
randomly  selected  non-face  patterns.  To  train  the  low- 
resolution  classifier  in  the  first  layer  we  downscaled  the 
images  from  19x19  to  9x9  pixels.  The  classifiers  in  the 
first  two  layers  were  trained  on  the  gray  value  features  de¬ 
scribed  in  Section  2.3.  The  third  classifier  was  trained  on 
PCA  gray  features  which  were  determined  by  the  feature 
selection  techniques  described  in  Sections  3  and  4  (1,000 
features  in  the  feature  space  determined  from  60  PCA  gray 
features  of  the  input  space).  The  ROC  curves  of  the  indi¬ 
vidual  classifiers  are  shown  in  Fig.  6  for  CMU  test  set  1.  In 
Fig.  8  we  show  the  data  fiow  through  the  hierarchy  for  the 
CMU  test  set  1 .  The  first  classifier  removes  more  than  90% 
of  the  background.  The  final  classifier  is  most  selective,  it 
classifies  more  than  99%  of  its  input  patterns  as  non-faces. 
In  Fig.  7  we  compare  the  ROC  curve  of  the  3 -level  system 
with  the  ROC  curve  of  the  original  single  SVM  classifier 
with  second-degree  polynomial  kernel.  The  performances 
are  similar.  The  average  computing  time  for  a  320x240 
image  is  shown  in  Table  1 .  We  achieved  a  speed-up  by  a 
factor  of  170  compared  to  the  original  system. 


Training:  2,429  faces  /  4,548  non-faces. 

Test:  CMU  set  1  / 118  images  /  479  faces  /  56,774,966  windows. 


Figure  6.  ROC  curves  for  the  three  classifiers 
of  the  hierarchical  system  for  the  CMU  test 
set  1. 


Training:  2,429  faces  /  4,548  non-faces. 

Test:  CMU  set  1/118  images  /  479  faces  /  56,774,966  windows. 


Figure  7.  ROC  curve  of  the  3  level  hierarchical 
system  and  the  original  system  for  the  CMU 
test  set  1 . 


6  Conclusion  and  Future  Work 

In  this  paper  we  presented  speed-up  methods  for  object 
detection  systems  based  on  feature  reduction  and  hierarchi¬ 
cal  classification.  The  feature  reduction  was  done  by  rank¬ 
ing  and  then  selecting  PCA  gray  features  according  to  a 
classification  criterion  that  was  derived  from  learning  the¬ 
ory.  Applied  to  a  face  detection  system  we  could  remove 
99%  of  the  original  features  without  loss  in  classification 
performance.  To  quickly  remove  large  background  parts 
of  an  image  we  arranged  three  classifiers  with  increasing 
computational  complexity  in  a  hierarchical  structure.  Ex¬ 
periments  with  a  face  detection  system  show  that  the  com¬ 
bination  of  feature  selection  and  hierarchical  classification 
speeds-up  the  system  by  a  factor  of  170  while  maintaining 
the  classification  accuracy.  In  future  work  we  will  run  ex¬ 
periments  on  larger  training  sets,  apply  feature  reduction  to 
all  levels  of  the  hierarchical  classifier,  and  explore  ways  of 
finding  the  optimal  number  of  levels.  We  will  also  perform 
experiments  with  hierarchical  training  where  each  classifier 
is  trained  on  the  outputs  of  the  classifier  of  the  previous 
level. 

Acknowledgements 

The  authors  would  like  to  thank  S.  Prentice  for  helping 
with  the  experiments  .  The  research  was  partially  sponsored 
by  DARPA  under  contract  No.  NOOO 14-00- 1-0907,  and  Na¬ 
tional  Science  Foundation  under  contract  No.  IIS-9800032. 
Additional  support  was  provided  by  the  DEG,  Eastman  Ko¬ 
dak,  and  Compaq. 

References 

[1]  A.  Blum  and  P.  Langley.  Selection  of  relevant  features 
and  examples  in  machine  learning.  Artificial  Intelligence, 


10:245-271,  1997. 

[2]  P.  J.  Burt.  Smart  sensing  within  a  pyramid  vision  machine. 
Proc.  /FEE,  76(8):  1006-1015,  1988. 

[3]  B.  Heisele,  T.  Poggio,  and  M.  Pontil.  Face  detection  in  still 
gray  images.  A.I.  memo  1687,  Center  for  Biological  and 
Computational  Learning,  MIT,  Cambridge,  MA,  2000. 

[4]  L.  Itti,  C.  Koch,  and  E.  Niebur.  A  model  of  saliency-based 
visual  attention  for  rapid  scene  analysis.  IEEE  Trans.  Pat¬ 
tern  Analysis  and  Machine  Vision,  20(11):  1254-1259,  1998. 

[5]  R.  Kohavi.  Wrappers  for  feature  subset  selection.  Artificial 
Intelligence,  special  issue  on  relevance,  97:273-324,  1995. 

[6]  M.  Oren,  C.  Papageorgiou,  P.  Sinha,  E.  Osuna,  and  T.  Pog¬ 
gio.  Pedestrian  detection  using  wavelet  templates.  In  IEEE 
Conference  on  Computer  Vision  and  Pattern  Recognition, 
pages  193-199,  San  Juan,  1997. 

[7]  S.  Romdhani,  P  Torr,  B.  Schoelkopf,  and  A.  Blake.  Compu¬ 
tationally  efficient  face  detection.  Proc.  ICCV,  11:695-700, 
2001. 

[8]  A.  Rosenfeld  and  G.  J.  Vanderbrug.  Coarse-fine  template 
matching.  IEEE  Transactions  on  Systems,  Man  and  Cyber¬ 
netics,  2:104-107,  1977. 

[9]  H.  A.  Rowley,  S.  Baluja,  and  T.  Kanade.  Rotation  invari¬ 
ant  neural  network-based  face  detection.  Computer  Scienct 
Technical  Report  CMU-CS-97-201,  CMU,  Pittsburgh,  1997. 

[10]  K.-K.  Sung.  Learning  and  Example  Selection  for  Object 
and  Pattern  Recognition.  PhD  thesis,  MIT,  Artificial  Intel¬ 
ligence  Laboratory  and  Center  for  Biological  and  Computa¬ 
tional  Learning,  Cambridge,  MA,  1996. 

[11]  V.  Vapnik.  Statistical  learning  theory.  John  Wiley  and  Sons, 
New  York,  1998. 

[12]  P  Viola  and  M.  Jones.  Robust  real-time  face  detection.  Proc. 
/CCV,  20(11):  1254-1259,  2001. 

[13]  J.  Weston,  S.  Mukherjee,  O.  Chapelle,  M.  Pontil,  T.  Pog¬ 
gio,  and  V.  Vapnik.  Feature  selection  for  support  vector 
machines.  In  Advances  in  Neural  Information  Processing 
Systems  13,  2001. 


Background 


*  Percentage  relative  to  the  number  of  input  patterns  of  the  elassifier 


Figure  8.  Data  flow  for  the  3-level  hierarchy 
of  classifiers  determined  on  the  CMU  test 
set  1. 


System 

Typical 
detection  time 

Speed-up 

factor 

Single  2^^  degree 
polynomial  SVM 

271  s 

— 

Single  2^^  degree 
polynomial  SVM 
+  Feature  reduction 

63.8  s 

4.25 

3 -Level  hierarchy 
-h  Feature  reduction 

1.6  s 

170 

Table  1.  Computing  time  for  a  320x240  im¬ 
age  processed  on  a  dual  Pentium  III  with 
733  MHz.  The  original  image  was  rescaled 
in  5  steps  to  detect  faces  at  resolutions  be¬ 
tween  26x26  and  60x60  pixels. 
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Figure  9.  Detections  at  each  level  of  the  hierarchy. 


